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Abstract. Wc show how to use a balanced wavelet tree as a data struc- 
ture that stores a list of numbers and supports efficient range quantile 
queries. A range quantile query takes a rank and the endpoints of a sub- 
list and returns the number with that rank in that sublist. For example, 
if the rank is half the sublist 's length, then the query returns the sub- 
list's median. We also show how these queries can be used to support 
space-efBcient coloured range reporting and document listing. 

1 Introduction 

If we arc given a list of the closing prices of a stock for the past n days and asked 
to find the kth lowest price, then we can do so in 0{n) time [2]. We can also 
preprocess the list in 0(nlogn) time and store it in 0{n) words such that, given 
k later, we can find the answer in 0(1) time: we simply sort the list. However, we 
might also later face range quantile queries, which have the form "what was the 
fcth lowest price in the interval between the .^th and the rth days?" . Of course, we 
could precompute the answers to all such queries, but storing them would take 
f2{n^ logn) bits of space. In this paper we show how to use a balanced wavelet 
tree to store the list in 0{n) words such that we can answer range quantile 
queries in ©(log a) time, where a is the number of distinct items in the entire 
list. We can generalize our result to any constant number of dimensions but, 
currently, only by using slightly super-linear space. 

We know of no previous work on quantile queries^, but several authors have 
written about range median queries, the special case in which k is half the 
length of the interval between i and r. Krizanc, Morin and Smid [12] introduced 
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the problem of preprocessing for median queries and gave four solutions, three 
of which have worse bounds than using a balanced wavelet tree; their fourth 
solution involves storing ©(n^ log log n/ log words to answer queries in 0(1) 
time. Bosc, Kranakis, Morin and Tang [3] then considered approximate queries, 
and Har-Peled and Muthukrishnan [10] and Gfeller and Sanders [8] considered 
batched queries. Recently, Krizanc et aUs fourth solution was superseded by 
one due to Petersen and Grabowski [16,17], who reduced the space bound to 
0(71^ (log log n)^/ log^ n) words. Table 1 shows the bounds for Krizanc et aVs 
first three solutions, for Petersen and Grabowski's solution, and for using a 
balanced wavelet tree. 

Har-Peled and Muthukrishnan [10] describe applications of median queries 
to the analysis of Web advertising logs. In the final section of this paper we 
show that our solution for quantile queries can be used to support coloured 
range reporting, that is, to enumerate the distinct items in a sublist. This result 
immediately improves Valimaki and Makinen's recent space-efficient solution to 
the document listing problem [14, 19]. 

In the full version of this paper we will also discuss how to use a wavelet tree 
to answer range counting queries (see [13]), coloured range counting queries (re- 
turning the number of distinct elements in a range without cnmncrating them), 
and how to support updates at the cost of slowing queries down to take time 
proportional to the logarithm of the largest number allowed. 

2 Wavelet Trees 

Grossi, Gupta and Vitter [9] introduced wavelet trees for use in data compression, 
and Ferragina, Giancarlo and Manzini [6] showed they have myriad virtues in this 
respect. Wavelet trees are also important for compressed full-text indexing [15]. 
As we shall see, there is yet more to this intriguing data structure. 

A wavelet tree T for a sequence s of length n is an ordered, strictly binary 
tree whose leaves are labelled with the distinct elements in s in order from left 
to right and whose internal nodes store binary strings. The binary string at 
the root contains n bits and each is set to or 1 depending on whether the 
corresponding character of s is the label of a leaf in T's left or right subtree. 
For each internal node v of T, the subtree T„ rooted at v is itself a wavelet 
tree for the subsequence of s consisting of the occurrences of its leaves' labels. 



For example, if s = a, b, r, a, c, a, d, a, b, r, a and the leaves in T's left subtree are 
labelled a, b and c, then the root stores 00100010010, the left subtree is a wavelet 
tree for abacaaba and the right subtree is a wavelet tree for rdr. The important 
properties of the wavelet tree for our purposes are summarized in the following 
lemma. 

Theorem 1 (Grossi et al. [9]) The wavelet tree T for a list of n elements 
on alphabet a requires nlogij(l + o(l)) bits of space, and can be constructed in 
O(nloga) time. 

To see why the space bound is true, consider that the binary strings' total 
length is the sum over the distinct elements of their frequencies times their 
depths, which is O(nlogo-) bits. The construction time bound is easy to see 
from the recursive description of the wavelet tree given above. 

We note as an aside that, while investigating data structures that support 
rank and select queries, Makinen and Navarro [13] pointed out a connection be- 
tween wavelet trees and a data structure due to Chazelle [4] for two-dimensional 
range searching on sets of points. 

3 Range Quantile Queries 

Wo now describe how the wavelet tree can be used to answer quantile queries. 
Let s be the list of n numbers we want to query. We build and store the wavelet 
tree T for s and, at each internal node v, we store a small data structure that 
lets us perform C'(l)-timc rank queries on w's binary string. A rank query on a 
binary string takes a position and returns the number of Is in the prefix that 
ends at that position. Jacobson [11] and later Clark [5] showed we can support 
0(l)-time rank queries on a binary string with a data structure that uses a 
sublinear number of extra bits, beyond those needed to store the string itself. It 
follows that the size of this preprocessed wavelet tree remains 0{n log a) bits. 

Given fc, £ and r and asked to find the fcth smallest number in .s[^..r]. we 
start at the root of T and consider its binary string b. We use the two rank 
queries rankb(^ — 1) and rankf,(r) to find the numbers of Os and Is in b[l..£ — 1] 
and b[£..r]. If there arc more than k copies of in b[£..r], then our target is a 
label on one of the leaves in T's left subtree, so we set £ to one more than the 
number of Os in b[l..l — 1], set r to the number of Os in 6[l..r], and recurse on 
the left subtree. Otherwise, our target is a label on one of the leaves in T's right 
subtree, so we subtract from k the number of Os in &[£..r], set £ to one more than 
the number of Is in b[l..£ — 1], set r to the number of Is in 6[l..r], and recurse 
on the right subtree. When we reach a leaf, we return its label. An example is 
given in Figure 1. Since T is balanced and we spend constant time at each node 
as we descend (using the rank structures), our search takes 0{loga) time. Thus, 
together with Theorem 1 we have the following. 

Theorem 2 There exists a data structure of size 0{n log a) bits which can be 
built in 0{n\oga) time that answers range quantile queries on s[l..n] in ©(logcr) 
time. 
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Fig. 1. A wavelet tree T (left) for s = 6, 2, 0, 7, 9, 3, 1, 8, 5, 4, and the values (right) the 
variables k, £ and r take on as we search for the 5th smallest clement in s[3..9]. The 
dashed boxes in T show the ranges from which we recursively select. 

Some comments on a are in order at this point. Firstly, and obviously, if a 
is constant, then so is our query time. If we represent the binary strings at each 
level of the wavelet tree with a more complicated rank/select data structure of 
Raman et. al [18] (instead of Clark [5], sec [9, 13]), the size of the wavelet tree is 
reduced to uHq (s) + 0{n log log n/ log^ n) bits without affecting the query time, 
where Ho{s) is the zeroth order entropy of s. Prior solutions for median queries 
do not make such opport,unistic use of space. 

At the other extreme, if a is f2{n) we can map the symbols in s to the range 
[l..n], by first sorting the items in 0{nlogn) time, and storing the mapping in 
0{n log cr) bits of space. Preprocessing the array this way. and then using the 
wavelet tree approach above, allows us to match the i7(n log n) time lower bound 
for median queries [12], when the number of queries is 0{n). This lower bound 
applies to any computational model which has an J7(nlogn) time lower bound 
on sorting s. Still, the solution is not completely satisfying, and we leave an open 
question: Does an 0{n log n) preprocessing algorithm exist that allows quantile 
(or even just median) queries to be answered in o(logn) time when a is f2{n)l 

It is not difficult to generalize Theorem 2 to any constant number of dimen- 
sions, using slightly super-linear space. Suppose we are given a given a multidi- 
mensional array A of total size A'^. We build a balanced binary search tree on 
the a' distinct elements in A and, at each node v, we store a binary array of 
size A'' with Is indicating the positions of occurrences of elements in u's subtree. 
We store each binary array in a folklore data structure (see, e.g., [1, Lemma 2]) 



that supports multidimensional range counting in 0(1) time using 0(rnN'^) bits, 
where m is the number of Is and e is any positive constant; thus, we use a total 
of 0{N^^^ log cr') bits. To find the fcth smallest number in a given range in A, we 
start at the root of the tree and use a range counting query to find the numbers 
of Os and Is in the same range of the binary array stored there. If there are more 
than k copies of in the range, then we recurse on the left subtree; otherwise, 
we subtract the number of Os from k and recurse on the right subtree. Since we 
use a single range counting query at each node as we descend, we use a total of 
^(log a') time. 

Theorem 1. For any constants d and e > 0, there exists a data structure of 
size 0{N^'^'^ logcr') hits that answers d- dimensional range quantile queries on A 
in 0{loga') time. 

4 Application to Space Efficient Document Listing 

The algorithm for quantile queries just described can, when coupled with another 
wavelet tree property, be used to enumerate the d distinct items in a given sublist 
s[£..r] in 0{d\oga) time as follows. Let Ci,C2, . . . ,Cd be the distinct elements in 
s[£..r] and, without loss of generality, assume Ci < C2 < ■ . ■ < Cd- Further, let 
mi, i G l..d be the number of times q occurs in .s[^..r]. To enumerate the q, we 
begin by finding Ci, which can be achieved in ©(log a) via a quantile query, as 
ci must be the element with rank 1 in s[£..r]. Observe now that C2 must be the 
element in the range with rank mi + 1, and in general Ci is the element with 
rank 1 + J2]^i ^j+i- Fortunately, each mi can be determined in ©(log a) time 
by exploiting a well known property of wavelet trees, namely, their ability to 
return, in ©(log cr) the number of occurrences of a symbol in a prefix of s (see 
[9]). Each jTij is the difference of two such queries. 

The document listing problem [14] is a variation on the classical pattern 
matching problem. Instead of returning all the positions at which a pattern P 
occurs in the text T, we consider T as a collection of k documents (concatenated) 
and our task is to return the set of documents in which P occurs. 

Muthukrishnan [14], who first considered the problem, gave an ©(nlogn) bit 
data structure (essentially a heavily preprocessed suffix tree) that lists documents 
in optimal 0{\P\ + ndoc) time, where ndoc is the number of documents contain- 
ing P. Recently, Valimaki and Makinen [19] used more modern compressed and 
succinct data structures to reduce the space requirements of Muthukrishnan's 
approach at the cost of slightly increasing search to 0{\P\ + ndoclogk) time. 
Their data structure consists of three pieces: the compressed suffix array (CSA) 
of T; a wavelet tree built on an auxilliary array, E (described shortly); and a 
succinct range minimum query data structure [7]. 

Central to both Muthukrishnan's and Valimaki and Makinen's solutions is 
the so-called "document array" E[l..n], which is parallel to the suffix array 
SA[l..n]: E\i] is the document in which suffix SA[i] begins. Given an interval 
SA[i..j] where all the occurrences of a pattern lie, the document listing problem 



then reduces to enumerating the distinct items in E[i..j]. Without getting into 
too many details, Vahmaki and Makinen use the compressed suffix array (CSA) 
of T to find the relevant sublist of E in 0(1^1) time, and then a combination 
of E^s wavelet tree and a range minimum query data structure [7] to enumerate 
the distinct items in that sublist in O{ndoc\ogk) time. However, as we have 
described above, the wavelet tree of E alone is sufficient to solve this problem in 
the same O{ndoclog k) time bound. In practice we may expect this new approach 
to be faster, as the avoidance of the minimum queries should reduce CPU cache 
misses. Also, because the wavelet tree of E is already present in [19] we have 
reduced the size of their data structure by 2n + o(n) bits, the size of the data 
structure for minimum queries. 
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